GithubHelp home page GithubHelp logo

Comments (2)

Hu-JIN avatar Hu-JIN commented on August 19, 2024

Hi Maia,

Thank you for raising the issue. There are several things to consider when running signature discovery on a large number of samples (~6000 in your case). But before I elaborate on those, it's good to make sure that parallel calculation is indeed working on your machine:

  • Make sure there are at least 20 CPUs on the compute node.
  • Compare the compute time for -n 20 -c 1, -n 1 -c 20, and -n 1 -c 1 (with ncpu=1) on a small test example. Make sure that -n 20 -c 1 is indeed the preferred setting on your HPC and that the expected speedup relative to the serial job is observed. This is worth checking because there might be system differences.

Now, things to consider when running signature discovery on a large number of samples.

First, try changing the default parameters to speed up the calculation.
The default parameters in DenovoSig are set quite conservatively to ensure convergence of the optimization algorithm. But we have observed empirically that even with more aggressive parameters, we could still achieve fairly good results. To scale up the calculation to a large number of samples (e.g., many thousands), it does make sense to use more aggressive parameters. I've just tested the following parameters on ~6300 TCGA WES samples and the discovered SBS signatures were of high quality to me. min_iter=1000, max_iter=10000, conv_test_freq=100, tol=1e-5, n_replicates=20, mvnmf_lambda_tilde_grid=np.array([1e-10, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0]). These parameters 1) terminate the optimization earlier and 2) use a more coarse-grained grid for selecting the best regularization parameter. For my test with ~6300 samples, it took 2hr10min with 20 CPUs on our HPC, when min_n_components=1, max_n_components=10. Note that you can always modify these parameters more towards the default later to see if there is any major difference in the results.

Second, try using NMF results to inform mvNMF runs.
DenovoSig with NMF (method='nmf') is fast: for example, with default parameters, it only took 40min for the dataset above with ~6300 samples (again 20 CPUs). This run will provide the best n_components (i.e., number of signatures) estimate, or at least the range of that. Then, you can run DenovoSig again, this time with mvNMF (method='mvnmf'), while setting min_n_components and max_n_components to be around that estimate. This procedure saves computation time by narrowing the search space for n_components.

Third, think about whether running signature discovery on the entire dataset is the optimal approach.
I don't know the details of your ~6000 samples. But for tumors, that number usually means a cohort composed of many tumor types. If that's the case, it's worth considering whether splitting them up is a more reasonable approach. For example, you can run signature discoveries per tumor type. You can also try stratifying samples into distinct groups beforehand with musical.preprocessing.stratify_samples(), which can run directly on the input matrix.

Hope this helps!

Best,
Hu

from musical.

Hu-JIN avatar Hu-JIN commented on August 19, 2024

Let me know if there are any further questions. Closing this issue for now.

from musical.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.