xiaoyeye / cnnc Goto Github PK

covolutional neural network based coexpression analysis

License: MIT License

Python 100.00%

cnnc's Introduction

CNNC

Title: Deep learning for inferring gene relationships from single-cell expression data

Originally, CNNC is short for Convolutional neural network co-expression analysis. Co-expression is just one of the tasks CNNC can do, but the name (CNNC) derived from it looks very nice.

see `https://www.pnas.org/content/early/2019/12/09/1911536116` for details.

date: 2019-07-07

1, CNNC

CNNC aims to infer gene-gene relationships using single cell expression data. For each gene pair, sc RNA-Seq expression levels are transformed into 32×32 normalized empirical probability function (NEPDF) matrices. The NEPDF serves as an input to a convolutional neural network (CNN). The intermediate layer of the CNN can be further concatenated with input vectors representing Dnase-seq and PWM data. The output layer can either have a single, three or more values, depending on the application. For example, for causality inference the output layer contains three probability nodes where p0 represents the probability that genes a and b are not interacting, p1 encodes the case that gene a regulates gene b, and p2 is the probability that gene b regulates gene a.

2, Pipelines

(a) Pipeline for TF-target, KEGG and Reactome edge predictions. Users only need to provide gene-pair candidate list. TF-tatget prediction is cell type specific. Here we provide the model for mESC TF prediction. Please use mESC expression data to generate mESC subset and then do following NEPDF generation and classification of training and test, and use the big scRNA-seq and bulk data to do pathway tasks. (b) Pipeline for a new task with the expression data we collected. Users need to provide gene-pair candidate list to generate NEPDF list and label list to train and test model. (c) Pipeline for a new task with the expression data users collect. Users need to provide gene-pair candidate list, their own expression data to generate NEPDF list, and label list to train and test model.

3, Data sources

3.1 scRNA-seq :

https://s3.amazonaws.com/mousescexpression/rank_total_gene_rpkm.h5

3.2 bone marrow drived macrophage scRNA-seq :

https://mousescexpression.s3.amazonaws.com/bone_marrow_cell.h5

3.3 bulk RNA-seq :

https://s3.us-east-2.amazonaws.com/mousebulkexprssion/mouse_bulk.h5

3.4 mesc single cell RNA-seq:

https://mousescexpression.s3.amazonaws.com/mesc_cell.h5

3.5 dendritic single cell RNA-seq:

https://mousescexpression.s3.amazonaws.com/dendritic_cell.h5

4, Code environment

Users need to install the latest python and all the modules required by the code.

Author's environment is python 3.6.3 in a Linux server which is now running Centos 6.5 as the underlying OS and Rocks 6.1.1 as the cluster management revision.

And Author uses theano as the Keras backend in python.

Author's GPU is GeForce GTX 1080. If the latest theano does not work, please try some older versions.

Although not necessary, we strongly recommend GPU acceleration and conda management for package, dependency and environment to save time. With conda, the total software, package module installation time in Python should be less than one hour.

5, Trained model for

(see folder for details)

5.1 KEGG Pathway prediction model

5.2 Reactome Pathway prediction model

5.3 GTRD bone TF prediction model

6, Train model for a new task

Users can define their own tasks by providing new expression data and (or) new gene pair labels.

7, Command lines for Trained model

7.1 step1, users need to provide gene pair list;

gene_pair_list is the list that contains gene pairs and their labels. format : 'GeneA GeneB ' or 'GeneA GeneB 0' such as mmukegg_new_new_unique_rand_labelx_sy.txt and mmukegg_new_new_unique_rand_labelx.txt in data folder. users also need to provide data_separation index_list which is a number list dividing gene_pair_list into small parts.

Here we use data separation index list to divide gene pairs into small data parts, and make sure that the gene pairs in each index inteval is completely isolated from others. We can evaluate CNNC's performance on only a small data part. If users do not want to specified separate data, they can just generate a index list to divide the data into N equal parts.

7.2 step2, use get_xy_label_data_cnn_combine_from_database.py to get gene pair NEPDF list;

Usage: python get_xy_data_cnn_combine_from_database.py bulk_gene_list.txt sc_gene_list.txt gene_pair_list data_separation_index_list bulk_expression_data sc_exprsssion_data 0

command line in author's linux machine :

python get_xy_label_data_cnn_combine_from_database.py /home/yey/CNNC-master/data/bulk_gene_list.txt /home/yey/CNNC-master/data/sc_gene_list.txt /home/yey/CNNC-master/data/mmukegg_new_new_unique_rand_labelx.txt /home/yey/CNNC-master/data/mmukegg_new_new_unique_rand_labelx_num_sy.txt /home/yey/sc_process_1/new_bulk_mouse/prs_calculation/mouse_bulk.h5 /home/yey/sc_process_1/rank_total_gene_rpkm.h5 0

#################INPUT################################################################################################################################

#1, bulk_gene_list.txt is the list that converts bulk expression data gene set into gene symbol IDs. Format: 'gene symbol IDs\t bulk gene ID'. Set as None if you do not have it.

#2, sc_gene_list.txt is the list that converts sc expression data gene set into gene symbol IDs. Format: 'gene symbol IDs\t sc gene ID'. please notice that mesc single cell data is not from the big 40k data, so we proivde a new sc_gene_list.txt file for it. . Set as None if you do not have it.

#3, gene_pair_list is the list that contains gene pairs and their labels. format : 'GeneA GeneB'

#4, data_separation index_list is a number list that divides gene_pair_list into small parts

#Here we use data separation index list to divide gene pairs into small data parts, and make sure that the gene pairs in each index inteval is completely isolated from others. And we can evaluate CNNC's performance on only a small data part.

#if users do not need to separate data, they can just generate a index list to divide the data into N equal parts.

#5, bulk_expression_data it should be a hdf5 format. users can use their own data or data we provided. Set as None if you do not have it.

#6, sc expression data it should be a hdf5 format. users can use their own data or data we provided. Set as None if you do not have it.

#7， flag, 0 means do not generate label list; 1 means to generate label list.

If user does not have bulk (single cell) data, just put Nones for bulk_gene_list.txt (sc_gene_list.txt) and bulk_expression_data (sc expression data).

#################OUTPUT

It generates a NEPDF_data folder, and a series of data files containing Nxdata_tf (NEPDF file) and zdata_tf (gene symbol pair file) for each data part divided.

Here we use gene symbol information to align bulk, scRNA-seq and gene pair's gene sets. In our own data, scRNA-seq used entrez ID, bulk RNA-seq used ensembl ID, gene pair list used gene symbol ID, thus we used bulk_gene_list.txt and sc_gene_list.txt to convert all the IDs to gene symbols. Please also do IDs convertion for bulk and scRNA-seq data if users want to use their own expression data.

7.3 step3, use predict_no_y.py to do prediction;

Usage: python predict_no_y.py number_of_separation NEPDF_pathway number_of_categories model_pathway

command line in author's linux machine :

python predict_no_y.py  9 /home/yey/CNNC-master/NEPDF_data  3 /home/yey/CNNC-master/trained_models/KEGG_keras_cnn_trained_model_shallow.h5

(In the models folder are trained models for KEGG and Reactome database respectively)

8, Command lines for Train new model:

8.1 step1, users need to provide gene pair candidate list (the same to step1 in trained_model except the flag setting);

gene_pair_list is the list that contains gene pairs and their labels. format : 'GeneA GeneB 0' such as mmukegg_new_new_unique_rand_labelx_sy.txt and mmukegg_new_new_unique_rand_labelx.txt in data folder. users also need to provide data_separation index_list which is a number list dividing gene_pair_list into small parts

Here we use data separation index list to divide gene pairs into small data parts, and make sure that the gene pairs in each index inteval is completely isolated from others. And we can evaluate CNNC's performance on only a small data part.

If users do not need to separate data, they can just generate a index list to divide the data into N equal parts.

8.2 step2, use get_xy_label_data_cnn_combine_from_database.py to get gene pair NEPDF list and their labels (the same to step2 in trained_model except flag setting);

Usage: python get_xy_data_cnn_combine_from_database.py bulk_gene_list.txt sc_gene_list.txt gene_pair_list data_separation index list bulk expression data sc exprsssion data 1

command line in author's linux machine :

python get_xy_label_data_cnn_combine_from_database.py /home/yey/CNNC-master/data/bulk_gene_list.txt /home/yey/CNNC-master/data/sc_gene_list.txt /home/yey/CNNC-master/data/mmukegg_new_new_unique_rand_labelx.txt /home/yey/CNNC-master/data/mmukegg_new_new_unique_rand_labelx_num_sy.txt /home/yey/sc_process_1/new_bulk_mouse/prs_calculation/mouse_bulk.h5 /home/yey/sc_process_1/rank_total_gene_rpkm.h5 1

#################INPUT################################################################################################################################

#1, bulk_gene_list.txt is the list that converts bulk expression data gene set into gene symbol IDs. Format: 'gene symbol IDs\t bulk gene ID'. Set as None if you do not have it.

#2, sc_gene_list.txt is the list that converts sc expression data gene set into gene symbol IDs. Format: 'gene symbol IDs\t sc gene ID'. Set as None if you do not have it.

#3, gene_pair_list is the list that contains gene pairs and their labels. format : 'GeneA GeneB 0'

#4, data_separation_index_list is a number list that divide gene_pair_list into small parts

#if users do not want to separate data, they can just generate a index list to divide the data into N equal parts.

#5, bulk_expression_data it should be a hdf5 format. users can use their own data or data we provided. Set as None if you do not have it.

#6, sc_expression_data it should be a hdf5 format. users can use their own data or data we provided. Set as None if you do not have it.

#7， flag, 0 means do not generate label list; 1 means to generate label list.

If user does not have bulk (single cell) data, just put Nones for bulk_gene_list.txt (sc_gene_list.txt) and bulk_expression_data (sc expression data). #################OUTPUT

It generate a NEPDF_data folder, and a series of data files containing Nxdata_tf (NEPDF file), ydata_tf (label file) and zdata_tf (gene symbol pair file) for each data part divided.

8.3 step3, use train_with_labels_three_foldx.py to train a new model with three-fold cross validation;

Usage python train_with_labels_three_foldx.py number_of_data_parts_divided NEPDF_pathway number_of_categories

command line in author's linux machine :

module load cuda-8.0 (it is to use GPU)
srun -p gpu --gres=gpu:1 -c 2 --mem=20Gb python train_with_labels_three_foldx.py 9 /home/yey/CNNC-master/NEPDF_data 3 > results.txt

#######################OUTPUT

It generates three cross_Validation folder whose name begins with YYYYY, in which keras_cnn_trained_model_shallow.h5 is the trained model

9 Attentions:

Please read the readme very carefully and make sure that all files have correct paths, since users may have different computer environments.

All command lines are just used for demo. If you want to run the real data, plz replace "mmukegg_new_new_unique_rand_labelx_num_sy.txt" with "mmukegg_new_new_unique_rand_labelx_num.txt", and replace all "9" with 3057, which is the real number of seperations.

When label list is very large, say more than 100,000 gene pairs, we recommend users to feed a series of small number_of_data_parts_divided to run the NEPDF generation in parallel.

We are exploring new tasks and new datatypes for CNNC, to be continued...

Enjoy our CNNC!!

cnnc's People

Contributors

Stargazers

Watchers

cnnc's Issues

Confusion about how the KEGG edges were filtered

Hi!

In the paper, you mentioned:

KEGG contains 290 pathways, and Reactome contains 1,581
pathways. For both, we only select directed edges with either activation or
inhibition edge types and filter out cyclic gene pairs where genes regulate
each other mutually (to allow for a unique label for each pair). In total, we
have 3,057 proteins with outgoing directed edges in KEGG, and the total
number of directed edges is 33,127. For Reactome, the corresponding
numbers are 2,519 and 33,641.

What I am not clear is if you removed all the cyclic gene pairs (A->B, and B->A), or just kept one direction. So for example, if you have A->B and B->A, you kept A->B but not B->A, or you removed both of them directly (?)

Also, if instead of predicting regulatory pairs, I only want to predict the pairs in the same pathway no matter their causality, then can I keep other KEGG edge types that are not activation or inhibition?

Thanks in advance!!

What's the meaning of "data separation index list" file?

Hello!
I want to predict some TF-target gene pairs with CNNC. However, I'm confused with the function of "data separation index list" file in section 7.2 and 8.2.
The annotation said this file is a number list that divide gene_pair_list into small parts. There are 92472 gene pairs in the CNNC-master/data/mmukegg_new_new_unique_rand_labelx.txt, which is the gene pair list. However the mmukegg_new_new_unique_rand_labelx_num_sy.txt file only contained 10 numbers:

I don't know how this file split the gene pair file. And how should I create this "data separation index list" according to my own gene pair list?

ChIP-seq peak p-value information

Hi!

I'm very interested in using your CNNC algorithm on one of my PhD. projects - it is great! I would like to test a couple of things before doing so, and therefore I was wondering if you have some intermediate files that inform about the p-value associated to each peak for all TFs in the bone marrow-derived macrophages, dendritic cells and mESC ChIP-seqs. I guess some intermediate file like this was used to build all the gene-pairs files.

Could you provide the files with this information? Something like: TF - Gene - pvalue would be great, but a standard file containing the TF name, start and end coordinates of the peak and the p-value associated to it would also work.

Thank you very much in advance,
Ines

clarification on getting started with CNN for bulk data and example gene-gene interactions

Hello,

I read your recent paper in PNAS with great interest and am really excited by the elegant way in which it is able to detect gene interaction networks. I would like to apply this to my data and was wondering if you could provide some clarification on a few questions I have, thanks!

I have built a standard classification NN with keras, that is aimed at understanding bulk RNA-seq data. The samples all consist of information such as sex, age and organ. so if I have an expression matrix of samples as cols and genes as rows:

sample 1 will have its own organ, sex and age label etc... This is therefore a multi-class multi label classification task (binary_crossentropy with sigmoid activation at the final layer)....

I read your paper with great interest:: I would like to combine this network with the CNN you have created so that I could identify gene-gene interactions between different sexes, different ages and organs etc.... basically, for the different classes...

One thing that is unclear to me from the source code, is where you obtain your gene-gene interaction data, and where this is fed in to the network as the reference interactions data? are y_train labels still the classes in my data such as organ age etc...

Is it possible for this network to be joined in to the sequential model at some concatenated layer between two networks?

sorry for the vague question, but I am just trying to understand the potential application here..

thanks!!

unable to achieve the accuracy from paper

hi there
i'm interested in your CNNC model and have read your PNAS paper
but here i met some problems to reproduce the result from paper

i run the demo literally follow the readme but the result is not good.

at about 20 epoch the model start to over fitting, during the rest 180 epoch the validation loss keep increasing while validation accuracy decreasing, (the result graph is just like other issues posted), finally the peak val acc before overfitting is about 48 to 52% (for my 3 try with learning rate 0.01, 0.003, 0.001), which it's far from the paper described.

it's seems that there is something wrong with the current codes. is the demo command in readme.md up to date? is the training code fit the demo? how did you divide the test set from raw data and get the test accuracy? from current predict_no_y.py it seems that it's reading the training set to do the prediction job on a very overfitting model.

would you please check the code or upload the latest version of code? thanks very much! looking forward to your reply and your next paper of GCNG!

Question about mesc_sc_gene_list.txt

Hi,

I'm interested in rerunning the mESC data analysis. I noticed that the format of the mesc_sc_gene_list.txt file is different from the sc_gene_list.txt file. Am I misunderstanding the mesc_sc_gene_list.txt file? I believe the first column is the gene name and the second should be a number.

Thanks,
Jean

Running predict_no_y.py with bulk data only

Hi, so I have run the example data on my machine with both bulk and single cell data and that works fine. However, when I attempted to run my own data which is only bulk the predict_no_y.py script failed with the error:

Traceback (most recent call last):
  File "predict_no_y.py", line 96, in <module>
    model.load_weights(model_path)
  File "D:\CNNC\venv\lib\site-packages\keras\engine\saving.py", line 492, in load_wrapper
    return load_function(*args, **kwargs)
  File "D:\CNNC\venv\lib\site-packages\keras\engine\network.py", line 1230, in load_weights
    f, self.layers, reshape=reshape)
  File "D:\CNNC\venv\lib\site-packages\keras\engine\saving.py", line 1237, in load_weights_from_hdf5_group
    K.batch_set_value(weight_value_tuples)
  File "D:\CNNC\venv\lib\site-packages\keras\backend\tensorflow_backend.py", line 2960, in batch_set_value
    tf_keras_backend.batch_set_value(tuples)
  File "D:\CNNC\venv\lib\site-packages\tensorflow_core\python\keras\backend.py", line 3323, in batch_set_value
    x.assign(np.asarray(value, dtype=dtype(x)))
  File "D:\CNNC\venv\lib\site-packages\tensorflow_core\python\ops\resource_variable_ops.py", line 819, in assign
    self._shape.assert_is_compatible_with(value_tensor.shape)
  File "D:\CNNC\venv\lib\site-packages\tensorflow_core\python\framework\tensor_shape.py", line 1110, in assert_is_compatible_with
    raise ValueError("Shapes %s and %s are incompatible" % (self, other))
ValueError: Shapes (512, 512) and (1536, 512) are incompatible

As far as I could tell the only difference between my files and the ones from the example was the lack of single cell data. So I tried running the example data without single cell and got the same error. So the dimensions of the NEPDF objects being (N, 32, 32) instead of (N, 64, 32) is clearly the issue, but I cannot figure out how to adjust the predict_no_y.py script. If you could explain it would be greatly appreciated, thanks!

Conda environment

Hi, seeing as you recommend using conda to manage the environment would it be possible to share your environment.yml file or the output from conda list as I am having some trouble finding the right combination of packages.

Thanks

histogram2d output transformation

Hi, thanks for the great study. I just have a quick question about the transformation in get_xy_label_data_cnn_combine_from_database.py:

HT = (log10(H / len(x_tf) + 10 ** -4) + 4) / 4

I guess the log10 and 1e-4 are to get small numbers for numerical stability. What are +4 and /4 for?

Thanks again!

Expression data

I could not find the expression data (h5 files) used for this experiment.
What is the shape of the expression data before it is processed into NEPDF format?
Is it N X 32 (N: number of genes, and 32 samples/conditions for each gene) or do you compress the information from much larger features set to just 32 for each gene?

Is CNCC available in R?

Is CNCC available or will be available in R? Thank you.

Troubles when using pretrained KEGG model

Hi, I am sorry to trouble you for my problems.
I downloaded the whole project and try to test your KEGG pre-trained weight with only mouse bulk dataset but there was a error about the shape of the weight matrix.
I added "model.load_weights(model_path,by_name=True)" and found the following information:ValueError: Layer #19 (named "dense_1"), weight <tf.Variable 'dense_1/kernel:0' shape=(512, 512) dtype=float32, numpy=
array( ………………） > has shape (512, 512), but the saved weight has shape (1536, 512).
I wonder if the using of single-cell dataset the key to this problem or maybe your weight is not fit for the code here?
I have tried to trained one with only bulk dataset and it was ok to predict. Well...
It would be greatly appreciated if you could tell me more, thanks.

Question about data files

Hi!

I'm interested in training a variant of CNNC and am starting with reproducing the experiment to Predict TF–Gene Interactions. My understanding is that I have to train on the data in the files data/bone_marrow_gene_pairs_200.txt, data/dendritic_gene_pairs_200.txt, and data/mesc_gene_pairs_400.txt. However, I'm confused why the labels come from three categories instead of 2. I thought this was a binary prediction task? Are the labels actually placeholder labels?

Thanks!

Error while training a new model

Hi,
I'm trying to follow your documentation on predicting edge scores on your example data. I ran into the following issues:

I get the following error when I ran the command KERAS_BACKEND=theano python predict_no_y.py 9 NEPDF_data/ trained_models/KEGG_keras_cnn_trained_model_shallow.h5

The output is:
Using Theano backend.
select <class 'type'>
(309, 64, 32, 1) x_test samples
Traceback (most recent call last):
File "predict_no_y.py", line 81, in
model.load_weights(model_path)
File "anaconda3/envs/cnnc/lib/python3.7/site-packages/keras/engine/saving.py", line 458, in load_wrapper
return load_function(*args, **kwargs)
File "anaconda3/envs/cnnc/lib/python3.7/site-packages/keras/engine/network.py", line 1208, in load_weights
with h5py.File(filepath, mode='r') as f:
File "anaconda3/envs/cnnc/lib/python3.7/site-packages/h5py/_hl/files.py", line 394, in init
swmr=swmr)
File "anaconda3/envs/cnnc/lib/python3.7/site-packages/h5py/_hl/files.py", line 170, in make_fid
fid = h5f.open(name, flags, fapl=fapl)
File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
File "h5py/h5f.pyx", line 85, in h5py.h5f.open
OSError: Unable to open file (truncated file: eof = 8650752, sblock->base_addr = 0, stored_eof = 8661656)

I'm using Linux OS and installed the packages using Anaconda. The h5 files are obtained after extracting the .rar file containing the models. Could you provide the original .hd5 files without using .rar compression?

So I tried training my own model using KERAS_BACKEND=theano python train_new_model/train_with_labels_wholedatax.py 9 NEPDF_data/ 3. Then I got another error:

Using Theano backend.
0 12
1 12
2 144
3 48
4 15
5 18
6 3
7 12
8 45
Traceback (most recent call last):
File "train_new_model/train_with_labels_wholedatax.py", line 63, in
(x_train, y_train,count_set_train) = load_data_TF2(whole_data_TF,data_path)
File "train_new_model/train_with_labels_wholedatax.py", line 51, in load_data_TF2
yydata_x = yydata_array.astype('int')
ValueError: invalid literal for int() with base 10: 'olfr1136\tgnal'

I'm not sure how to fix this.

Also, the usage for scripts get_xy_label_data_cnn_combine_from_database.py and predict_no_y differ from documentation and the actual scripts. Which is the correct command?

I'll appreciate it if you can help me with this/update your documentation accordingly.

Thank you,
-Aditya

The labels of result array?

Hello,
I got the result from predict_no_y.py, and the result .npy file was a Nx3 array (1,2,0 three labels in the trained model). And I have two quetions:

I don't know the label of each columns. Are the labels of columns the same as the previously trained model?

Or just the numeric sorting like [ 0,1, 2] or [2, 1, 0]?
Most of the results were approximately equal to [0.33, 0.33, 0.33], which seems like no meaning to the prediction. I want to know if it was normal or what can I do to improve the prediction?

I just want to build a TF-gene network for downstream analysis and know little about machine learning, but your CNNC model really attracted me! Thank you for your consideration！

train a new model

Unable to replicate results from the paper

Hi,
I'm trying to reproduce the results from your PNAS paper on CNNC. Specifically, I'm trying to reproduce the leave-one-TF-out results presented in Fig 2. for GTRD TF-target prediction for the Macrophage dataset. However, my validation accuracy and loss doesn't seem to improve in spite of running with recommended parameters from the supplement for over 200 epochs. Moreover, when I test it on the held-out TF, all the predicted output values are nearly identical and result in a close to random predictors performance.

Here's the steps I followed for the sake of reproducibility:

I first generated the NEPDFs using the following command: python get_xy_label_data_cnn_combine_from_database.py None data/sc_gene_list.txt data/bone_marrow_gene_pairs_200.txt data/bone_marrow_gene_pairs_200_num.txt None bone_marrow_cell.h5 1.
I then used this command to train a new model: KERAS_BACKEND=theano python train_new_model/train_with_labels_wholedatax.py 12 NEPDF_data/ 2
Here's the output at the end of training:
. Note that the validation accuracy never improves beyond 0.45 and neither does training accuracy, which is close to that of a random predictor.
I then tested it on a held out NEPDF data using the: python predict_no_y.py 1 NEPDF_data/ 2 xwhole_saved_models_T_32-32-64-64-128-128-512_e200/keras_cnn_trained_model_shallow.h5. The output has all values set to 1 (i.e., high confidence for all the edges). The output in predict_results_no_y_1/y_predict.npy is just an array filled with identical values.

I'm not sure what I'm missing here. Did you observe similar training and validation plots in your end_result.pdf at the end of training on this dataset? I'm seeing similar behavior for the mESC, dendritic cell, and on my own scRNA-seq datasets. I also attached the full log I obtained while training here:
Could you please help me with this?

Thanks in advance,
Aditya

Ability to use multiple scRNAseq datasets

Hi is there a way to use many (10+) scRNA-seq datasets as as inputs. I realize one could simply try to integrate them into one dataset but I wanted to check.
Thanks!

Can’t generate NEPDF data for the example sc-RNA seq data

Hello!
I have a new problem: The get_xy_label_data_cnn_combine_from_database.py report an error when generate NEPDF data for the example sc-RNA seq h5 file
python CNNC/get_xy_label_data_cnn_combine_from_database.py \ None \ CNNC/data/sc_gene_list.txt \ CNNC/data/dendritic_gene_pairs_200.txt \ CNNC/data/dendritic_gene_pairs_200_num.txt \ None \ CNNC/Data_sources/dendritic_cell.h5 \ 1. The dendritic_cell.h5 was downloaded from

Then it reported errors:

So I opened the h5 file by h5py and changed the key from "RPKMs" to "rpkm". However, it also reported errors:

It works well when I run this command for the bulk RNA seq data "mouse_bulk.h5", so I wonder if the sc-RNA seq data should be different from bulk data or some other reasons?
Tanks a lot!

confusion about the CNNC model

Hello！
your work about CNNC was magnificent, but I was an amateur about deep learning. So I was confused about the input file of CNNC, I performed scRNA-seq on my sample and got several genes want to check theirs relationship, so I have the expression profile as input, but what is the gene-pair candidate list? It is a gene pair list with certain relationship that learn from previous study？

PRE-TRAINED MODELS

The rar file for pretrained models is not opening, it says no archives found, from where can I get the pretrained model?

How to test for causal interactions

Hi,
I am very interested in the application of CNNC to predict the direction of causal edges between genes as reported in your PNAS paper. From the readme it is not clear to me how the input needs to be changed to get causal output from the network. My understanding is that the trained models provided are for finding undirected interactions and separate networks would have to be trained to find directed interactions, but I am not sure how to do this.
Thanks!

Confusion regarding inputs

Hello!

I am trying to use this to build a gene regulatory network based on my single cell data. However, I am slightly confused by your inputs. In #3 in your README, I have to input gene_pair_list with the genes and the labels. What if I don't know the labels, as that is what i'm trying to find out. If we already input the labels into the model, then what is this used for? Let me know. Thanks!!!

Data source of rank_total_gene_rpkm.h5

Hi!

I'm interested in the CNNC and I have read your article and run the example script.

here I have some questions.

firstly I noticed in your article you mentioned that your scRNA-seq dataset is from [13] A web server for comparative analysis of single-cell RNA-seq data. Nat. Commun. 9, 4768 (2018).

so from this reference article I visit their website https://scquery.cs.cmu.edu/ and download the dataset https://s3.us-east-2.amazonaws.com/sc-query/processed_data/expr_data.hdf5

but this dataset looks not the same with the provided rank_total_gene_rpkm.h5. is there anything I get wrong or you reprocess the scquecy's dataset? also could you please specify the source or reprocess procedure of other provided datasets? many thanks!

secondly I run the training script follow readme.md and meet some problem.

it took much more time to train the model than I expected. I remember you mentioned in the article's supplementary table that it need about 6 hours to run the kegg whole expression data with 1080ti.
while I run with a Tesla K40m (12G memory) and it takes about 30 hours to complete the 200 epochs. during the runtime I check the gpu and memory usage, the gpu is run at full capacity(~98%) but i found the process use only ~120m gpu memory, ~3g memory, but occupied over 100g virtual memory (is it means most of its data is stored on hdd thus make it run so slow?)
after 200 epochs here is the result:

from the end_result.pdf, the train acc start from 0.4 and reach 0.65 after 20 epochs but val acc keep stuck below 0.5 and the val_loss evem keep increasing during 200 epochs while train loss is decreasing.

could you please figure out what did i get wrong?
all data provided by readme.md, my command like this:

python3 get_xy_label_data_cnn_combine_from_database.py
    bulk_gene_list.txt
    sc_gene_list.txt
    mmukegg_new_new_unique_rand_labelx.txt
    mmukegg_new_new_unique_rand_labelx_num.txt
    mouse_bulk.h5
    rank_total_gene_rpkm.h5
    1

python3 train_with_labels_wholedatax.py
    3057
    NEPDF_data/
    3

I'm very new to the machine learning so please forgive me if I made any stupid mistake or asking stupid question :)

Validation set query

Hi,

Thanks for the code and the great paper.

In your paper you state that for TF-gene prediction you did three-fold CV, and the 38 TFs were divided into roughly equal 3 folds, (13, 13, 12). It seems like in the code the validation set the TFs are not separated out from the training but a mixture of TFs are selected from the training set. Is this how you ran the code for the results in the paper? Just checking as trying to replicate your results to then run on some of my own!
Thanks

xiaoyeye / cnnc Goto Github PK

cnnc's Introduction

CNNC

Title: Deep learning for inferring gene relationships from single-cell expression data

see https://www.pnas.org/content/early/2019/12/09/1911536116 for details.

date: 2019-07-07

1, CNNC

2, Pipelines

3, Data sources

3.1 scRNA-seq :

3.2 bone marrow drived macrophage scRNA-seq :

3.3 bulk RNA-seq :

3.4 mesc single cell RNA-seq:

3.5 dendritic single cell RNA-seq:

4, Code environment

Users need to install the latest python and all the modules required by the code.

5, Trained model for

5.1 KEGG Pathway prediction model

5.2 Reactome Pathway prediction model

5.3 GTRD bone TF prediction model

6, Train model for a new task

Users can define their own tasks by providing new expression data and (or) new gene pair labels.

7, Command lines for Trained model

7.1 step1, users need to provide gene pair list;

7.2 step2, use get_xy_label_data_cnn_combine_from_database.py to get gene pair NEPDF list;

Usage: python get_xy_data_cnn_combine_from_database.py bulk_gene_list.txt sc_gene_list.txt gene_pair_list data_separation_index_list bulk_expression_data sc_exprsssion_data 0

command line in author's linux machine :

7.3 step3, use predict_no_y.py to do prediction;

Usage: python predict_no_y.py number_of_separation NEPDF_pathway number_of_categories model_pathway

command line in author's linux machine :

8, Command lines for Train new model:

8.1 step1, users need to provide gene pair candidate list (the same to step1 in trained_model except the flag setting);

8.2 step2, use get_xy_label_data_cnn_combine_from_database.py to get gene pair NEPDF list and their labels (the same to step2 in trained_model except flag setting);

Usage: python get_xy_data_cnn_combine_from_database.py bulk_gene_list.txt sc_gene_list.txt gene_pair_list data_separation index list bulk expression data sc exprsssion data 1

command line in author's linux machine :

8.3 step3, use train_with_labels_three_foldx.py to train a new model with three-fold cross validation;

Usage python train_with_labels_three_foldx.py number_of_data_parts_divided NEPDF_pathway number_of_categories

command line in author's linux machine :

9 Attentions:

Please read the readme very carefully and make sure that all files have correct paths, since users may have different computer environments.

All command lines are just used for demo. If you want to run the real data, plz replace "mmukegg_new_new_unique_rand_labelx_num_sy.txt" with "mmukegg_new_new_unique_rand_labelx_num.txt", and replace all "9" with 3057, which is the real number of seperations.

When label list is very large, say more than 100,000 gene pairs, we recommend users to feed a series of small number_of_data_parts_divided to run the NEPDF generation in parallel.

We are exploring new tasks and new datatypes for CNNC, to be continued...

Enjoy our CNNC!!

cnnc's People

Contributors

Stargazers

Watchers

Forkers

cnnc's Issues

Recommend Projects

Recommend Topics

Recommend Org

Jobs

see `https://www.pnas.org/content/early/2019/12/09/1911536116` for details.

7.2 step2, use `get_xy_label_data_cnn_combine_from_database.py` to get gene pair NEPDF list;

7.3 step3, use `predict_no_y.py` to do prediction;

8.2 step2, use `get_xy_label_data_cnn_combine_from_database.py` to get gene pair NEPDF list and their labels (the same to step2 in trained_model except flag setting);

8.3 step3, use `train_with_labels_three_foldx.py` to train a new model with three-fold cross validation;