GithubHelp home page GithubHelp logo

Comments (11)

jasongfleischer avatar jasongfleischer commented on August 17, 2024 1

Hi @reese3928 and @wangmhan

I am about 3/4 of the way through refactoring this code and making it a bit easier to use for people want to

  1. take a trained classifier and try to use it on their data
    or
  2. train a classifier on their own data and see how it predicts in a train-test or cross-validation scheme

I think many of the changes I am making will address your questions in this thread. I hope to be able to commit the new files by Monday or Tuesday.

As a general note of caution I will always expect (1) to do much worse than (2) due to batch effects, but I will try to make it as easy as possible for those that want to do (1)

To answer some questions:
@reese3928 asked what happens when new data has gene names that do not match the genes of the best model pkl file. Right now you can't use the saved best model. You have to retrain from scratch on your data. Once I check in the new code you will at least be able to try option (1)

Xu I think there are few things you're not understanding in general about how this works. You mention subsetting the genes on your own... in fact the algorithm does this for you, selecting a gene subset based on the expression in your dataset. We can talk about this in more detail perhaps over email if you'd like.

Right now it seems like people are having problems loading their own data... and I would expect this. I didn't write a generic load data function, I wrote one that loaded my particular data. Even with the new code that I am writing this will still be true, but I will try to make it at least easier for people to do this than it is now.

@wangmhan If you retrain on a new dataset that is in TPM or raw counts or RPKM or anything it should work just fine. I have trained on these kinds of data and seen pretty much the same performance as in FPKM.

As to the idea of running this on a server for you, I will look into figuring out how to configure binder to launch the notebook or creating a web app. But that will definitely be a bit down the road. In the meantime the new version of the code I will be uploading will make it easier for you to do stuff, but you will still have to supply your own python/jupyter environment. I suggest using anaconda as the easiest possible way to do this.

If there are other questions you'd like addressed, please start a new thread so I can answer them one a time. This thread got a bit crazy with lots of stuff going back and forth and I'm sure that I've missed at least a few points I should address.

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

jasongfleischer avatar jasongfleischer commented on August 17, 2024

Thanks for your interest. I am actively working on an updated version of the code which is more suitable for running on new data, as you suggest.

However in general the classes subset_genes_XXX will take any data you wish to feed them in the format where genes are in columns and samples are the rows. So if you wish to load up E-MTAB-3037 or any other dataset you can process them directly by doing the following...

    clf = subset_genes_XXX( parameters as you wish ) 
    genes = pandas dataframe of gene expression values, row=samples, col=genes
    ages = pandas series or np.array, one entry per sample in genes, in the same order as genes
    plot_val = sklearn.model_selection.LeaveOneOut() # or this can be any other cross validator you wish 

    true_age = []
    pred_age = []
    for train, test in plot_cval.split(genes):
        clf.fit(genes.iloc[train,:],ages[train])
        pred_age.append( clf.predict(genes.iloc[test,:]) )
        true_age.append( ages[test] )

Thanks for you interest in this work. My apologies for the delay in replying as my notifications were set incorrectly on GitHub.

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

reese3928 avatar reese3928 commented on August 17, 2024

Hi,

I have a follow up question about using this pipeline for another dataset. If we would like to train the ensemble LDA on the 133 healthy individuals and test it on another dataset, if I understand correctly, we could do (please correct me if I'm wrong):

## assuming all the variables/objects are appropriately loaded
clf = load('fig2_bestmodel_Ensemble LDA.pkl')
clf.fit(genes[normals], ages)
newdat_predict = clf.predict(newdat)

In genes[normals], there are 27,142 genes. If some of these 27,142 genes are not included in newdat, e.g. newdat contains 25,000 gene, which is a subset of 27,142 genes, how can we deal with this? By now, I can think of two ways:
1).
If we use the pre-trained model, we could do:

clf = load('fig2_bestmodel_Ensemble LDA.pkl')
temp = subset the 25,000 genes from genes[normals]
clf.fit(temp, ages)
newdat_predict = clf.predict(newdat)

2).
We can also subset the genes[normals] from the very beginning and run cell 275 in notebook from scratch without loading the clf. Basically, just rerun the ensemble LDA and generate a new clf.

Which approach is preferred? Is rebuilding the ensemble LDA necessary in this case? I'd appreciate if you could help. Thanks very much!

Sincerely,
Xu Ren

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

wangmhan avatar wangmhan commented on August 17, 2024

Hi Thank you for reply.
I tried to fit my own data use the script from the load_data part, and have some comments about it:

  1. I think the input data format is also an important part. The expression matrix part is kind of common, with gene as row and expression as column. I wonder if you test RPKM and TPM, will it be similar result? I don't know the first several column of gene information (chromosome, length, etc.) is useful or not?
  2. The information matrix is a bit complicated. I wonder if only some information are useful for following analysis? Such as the age?
    So maybe the new version could have a standard format for both these two matrix :)
  3. I have problems with the lc_sizes. As my samples is not the same number as yours. And also the age range is different, which I don't find which line I should change yet.
  4. I'm new to python, so I don't know how to apply the script to server. I use my own mac notebook to run, and even the linear regression model take more than 24 hours to get a result...
    So it would be nice if you also consider this issue when update the version.
  5. I think the output could be both figure and also a matrix of predicted age as well. Which I guess is in the pre_age.append as you also mentioned?
    Hope the new version coming soon, so that we can easily apply this method to other dataset as well! Thanks lot!

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

reese3928 avatar reese3928 commented on August 17, 2024

Hi @wangmhan ,

  1. I used FPKM, and I think the first few columns of gene information (chromosome, length ... ) is not required. In the author's @jasongfleischer notebook, these columns are dropped from the analysis before fitting the model. But as mentioned by the author @jasongfleischer in another post, there could be batch effect.

  2. I think when we are predicting age, only age is used.

  3. I used the author's default setting.

  4. To run python script in server, we can do python scriptname.py & to let the script run in the background.

  5. I think in the author's notebook, pred_age and true_age are numpy arrays.

Please correct me if I'm wrong. Thanks.

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

wangmhan avatar wangmhan commented on August 17, 2024

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

reese3928 avatar reese3928 commented on August 17, 2024

Hi @wangmhan ,

Based on my understanding, lc_sizes is used in the training data set. If we use the author's original training data, and then predict on another test data, I think lc_sizes should be fine. I don't quite understand why it doesn't work. Regarding point 4, I'm not sure really.

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

wangmhan avatar wangmhan commented on August 17, 2024

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

wangmhan avatar wangmhan commented on August 17, 2024

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

jasongfleischer avatar jasongfleischer commented on August 17, 2024

OK, @wangmhan and @reese3928. the new version is checked in. Note that I am using git lfs to host large binary data files of the trained ensembles... you will have to install it per the instructions before getting the new files.

Let me know if you have any trouble with the new notebooks or with git lfs.

--Jason

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

jasongfleischer avatar jasongfleischer commented on August 17, 2024

OK all, I pushed a bad update previously. My apologies. I misused git lfs previously. The notebooks and datasets for using the pre-trained classifiers have been updated again. Let me know if you have any problems.

from predicting-age-from-the-transcriptome-of-human-dermal-fibroblasts.

Related Issues (7)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.