Hi, This .ipynb file is only for generate the figure in your paper.

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Hi Thank you for reply. I tried to fit my own data use the from the load_da

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

OK, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url=

could this pipeline used for other datasets about predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts HOT 11 OPEN

wangmhan commented on August 17, 2024

could this pipeline used for other datasets

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

Comments (11)

jasongfleischer commented on August 17, 2024 1

Hi @reese3928 and @wangmhan

I am about 3/4 of the way through refactoring this code and making it a bit easier to use for people want to

take a trained classifier and try to use it on their data
or
train a classifier on their own data and see how it predicts in a train-test or cross-validation scheme

I think many of the changes I am making will address your questions in this thread. I hope to be able to commit the new files by Monday or Tuesday.

As a general note of caution I will always expect (1) to do much worse than (2) due to batch effects, but I will try to make it as easy as possible for those that want to do (1)

To answer some questions:
@reese3928 asked what happens when new data has gene names that do not match the genes of the best model pkl file. Right now you can't use the saved best model. You have to retrain from scratch on your data. Once I check in the new code you will at least be able to try option (1)

Xu I think there are few things you're not understanding in general about how this works. You mention subsetting the genes on your own... in fact the algorithm does this for you, selecting a gene subset based on the expression in your dataset. We can talk about this in more detail perhaps over email if you'd like.

Right now it seems like people are having problems loading their own data... and I would expect this. I didn't write a generic load data function, I wrote one that loaded my particular data. Even with the new code that I am writing this will still be true, but I will try to make it at least easier for people to do this than it is now.

@wangmhan If you retrain on a new dataset that is in TPM or raw counts or RPKM or anything it should work just fine. I have trained on these kinds of data and seen pretty much the same performance as in FPKM.

As to the idea of running this on a server for you, I will look into figuring out how to configure binder to launch the notebook or creating a web app. But that will definitely be a bit down the road. In the meantime the new version of the code I will be uploading will make it easier for you to do stuff, but you will still have to supply your own python/jupyter environment. I suggest using anaconda as the easiest possible way to do this.

If there are other questions you'd like addressed, please start a new thread so I can answer them one a time. This thread got a bit crazy with lots of stuff going back and forth and I'm sure that I've missed at least a few points I should address.

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

jasongfleischer commented on August 17, 2024

Thanks for your interest. I am actively working on an updated version of the code which is more suitable for running on new data, as you suggest.

However in general the classes subset_genes_XXX will take any data you wish to feed them in the format where genes are in columns and samples are the rows. So if you wish to load up E-MTAB-3037 or any other dataset you can process them directly by doing the following...

    clf = subset_genes_XXX( parameters as you wish ) 
    genes = pandas dataframe of gene expression values, row=samples, col=genes
    ages = pandas series or np.array, one entry per sample in genes, in the same order as genes
    plot_val = sklearn.model_selection.LeaveOneOut() # or this can be any other cross validator you wish 

    true_age = []
    pred_age = []
    for train, test in plot_cval.split(genes):
        clf.fit(genes.iloc[train,:],ages[train])
        pred_age.append( clf.predict(genes.iloc[test,:]) )
        true_age.append( ages[test] )

Thanks for you interest in this work. My apologies for the delay in replying as my notifications were set incorrectly on GitHub.

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

reese3928 commented on August 17, 2024

Hi,

I have a follow up question about using this pipeline for another dataset. If we would like to train the ensemble LDA on the 133 healthy individuals and test it on another dataset, if I understand correctly, we could do (please correct me if I'm wrong):

## assuming all the variables/objects are appropriately loaded
clf = load('fig2_bestmodel_Ensemble LDA.pkl')
clf.fit(genes[normals], ages)
newdat_predict = clf.predict(newdat)

In genes[normals], there are 27,142 genes. If some of these 27,142 genes are not included in newdat, e.g. newdat contains 25,000 gene, which is a subset of 27,142 genes, how can we deal with this? By now, I can think of two ways:
1).
If we use the pre-trained model, we could do:

clf = load('fig2_bestmodel_Ensemble LDA.pkl')
temp = subset the 25,000 genes from genes[normals]
clf.fit(temp, ages)
newdat_predict = clf.predict(newdat)

2).
We can also subset the genes[normals] from the very beginning and run cell 275 in notebook from scratch without loading the clf. Basically, just rerun the ensemble LDA and generate a new clf.

Which approach is preferred? Is rebuilding the ensemble LDA necessary in this case? I'd appreciate if you could help. Thanks very much!

Sincerely,
Xu Ren

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

wangmhan commented on August 17, 2024

Hi Thank you for reply.
I tried to fit my own data use the script from the load_data part, and have some comments about it:

I think the input data format is also an important part. The expression matrix part is kind of common, with gene as row and expression as column. I wonder if you test RPKM and TPM, will it be similar result? I don't know the first several column of gene information (chromosome, length, etc.) is useful or not?
The information matrix is a bit complicated. I wonder if only some information are useful for following analysis? Such as the age?
So maybe the new version could have a standard format for both these two matrix :)
I have problems with the lc_sizes. As my samples is not the same number as yours. And also the age range is different, which I don't find which line I should change yet.
I'm new to python, so I don't know how to apply the script to server. I use my own mac notebook to run, and even the linear regression model take more than 24 hours to get a result...
So it would be nice if you also consider this issue when update the version.
I think the output could be both figure and also a matrix of predicted age as well. Which I guess is in the pre_age.append as you also mentioned?
Hope the new version coming soon, so that we can easily apply this method to other dataset as well! Thanks lot!

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

reese3928 commented on August 17, 2024

Hi @wangmhan ,

I used FPKM, and I think the first few columns of gene information (chromosome, length ... ) is not required. In the author's @jasongfleischer notebook, these columns are dropped from the analysis before fitting the model. But as mentioned by the author @jasongfleischer in another post, there could be batch effect.
I think when we are predicting age, only age is used.
I used the author's default setting.
To run python script in server, we can do python scriptname.py & to let the script run in the background.
I think in the author's notebook, pred_age and true_age are numpy arrays.

Please correct me if I'm wrong. Thanks.

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

wangmhan commented on August 17, 2024

Hi Xu, About the point 3, as I am using another dataset, the default setting not work as the sample number is different. And about the point 4, I just wonder if there is a server jupyter version so you can interact with it, not just run the whole workflow and can only check the result. Xu Ren <[email protected]> 于2019年3月7日周四下午4:33写道：

…

Hi @wangmhan <https://github.com/wangmhan> , 1. I used FPKM, and I think the first few columns of gene information (chromosome, length ... ) is not required. In the author's @jasongfleischer <https://github.com/jasongfleischer> notebook, these columns are dropped from the analysis before fitting the model. But as mentioned by the author @jasongfleischer <https://github.com/jasongfleischer> in another post <#3>, there could be batch effect. 2. I think when we are predicting age, only age is used. 3. I used the author's default setting. 4. To run python script in server, we can do python scriptname.py & to let the script run in the background. 5. I think in the author's notebook, pred_age and true_age are numpy arrays. Please correct me if I'm wrong. Thanks. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AmuJ1nbm9HNo7TUHOnrR6u7KDrXkAdtuks5vUTFYgaJpZM4bAx0g> .

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

reese3928 commented on August 17, 2024

Hi @wangmhan ,

Based on my understanding, lc_sizes is used in the training data set. If we use the author's original training data, and then predict on another test data, I think lc_sizes should be fine. I don't quite understand why it doesn't work. Regarding point 4, I'm not sure really.

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

wangmhan commented on August 17, 2024

Hi Xu, I think you are correct. I don't need to change lc_sizes. Thank you for correcting me. Xu Ren <[email protected]> 于2019年3月8日周五下午4:25写道：

…

Hi @wangmhan <https://github.com/wangmhan> , Based on my understanding, lc_sizes is used in the training data set. If we use the author's original training data, and then predict on another test data, I think lc_sizes should be fine. I don't quite understand why it doesn't work. Regarding point 4, I'm not sure really. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AmuJ1mLaBQ8Ww3tp8zjE0lD-ZLSoPDXXks5vUoD4gaJpZM4bAx0g> .

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

wangmhan commented on August 17, 2024

Thank you and looking forward to the new version! :) Jason Fleischer <[email protected]> 于2019年3月9日周六上午2:38写道：

…

Hi @reese3928 <https://github.com/reese3928> and @wangmhan <https://github.com/wangmhan> I am about 3/4 of the way through refactoring this code and making it a bit easier to use for people want to 1. take a trained classifier and try to use it on their data or 2. train a classifier on their own data and see how it predicts in a train-test or cross-validation scheme I think many of the changes I am making will address your questions in this thread. I hope to be able to commit the new files by Monday or Tuesday. As a general note of caution I will always expect (1) to do much worse than (2) due to batch effects, but I will try to make it as easy as possible for those that want to do (1) To answer some questions: @reese3928 <https://github.com/reese3928> asked what happens when new data has gene names that do not match the genes of the best model pkl file. Right now you can't use the saved best model. You have to retrain from scratch on your data. Once I check in the new code you will at least be able to try option (1) Xu I think there are few things you're not understanding in general about how this works. You mention subsetting the genes on your own... in fact the algorithm does this for you, selecting a gene subset based on the expression in your dataset. We can talk about this in more detail perhaps over email if you'd like. Right now it seems like people are having problems loading their own data... and I would expect this. I didn't write a generic load data function, I wrote one that loaded my particular data. Even with the new code that I am writing this will still be true, but I will try to make it at least easier for people to do this than it is now. @wangmhan <https://github.com/wangmhan> If you retrain on a new dataset that is in TPM or raw counts or RPKM or anything it should work just fine. I have trained on these kinds of data and seen pretty much the same performance as in FPKM. As to the idea of running this on a server for you, I will look into figuring out how to configure binder to launch the notebook or creating a web app. But that will definitely be a bit down the road. In the meantime the new version of the code I will be uploading will make it easier for you to do stuff, but you will still have to supply your own python/jupyter environment. I suggest using anaconda as the easiest possible way to do this. If there are other questions you'd like addressed, please start a new thread so I can answer them one a time. This thread got a bit crazy with lots of stuff going back and forth and I'm sure that I've missed at least a few points I should address. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#4 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AmuJ1hPFX_HKPw5POZ3rhduuOJJaxkm5ks5vUxCBgaJpZM4bAx0g> .

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

jasongfleischer commented on August 17, 2024

OK, @wangmhan and @reese3928. the new version is checked in. Note that I am using git lfs to host large binary data files of the trained ensembles... you will have to install it per the instructions before getting the new files.

Let me know if you have any trouble with the new notebooks or with git lfs.

--Jason

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

jasongfleischer commented on August 17, 2024

OK all, I pushed a bad update previously. My apologies. I misused git lfs previously. The notebooks and datasets for using the pre-trained classifiers have been updated again. Let me know if you have any problems.

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

could this pipeline used for other datasets about predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts HOT 11 OPEN

Comments (11)

Related Issues (7)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs