Hi! Thank you for developing this tool! I am currently using it for

De novo signature discovery - very slow about musical HOT 2 CLOSED

parklab commented on August 19, 2024

De novo signature discovery - very slow

from musical.

Comments (2)

Hu-JIN commented on August 19, 2024

Hi Maia,

Thank you for raising the issue. There are several things to consider when running signature discovery on a large number of samples (~6000 in your case). But before I elaborate on those, it's good to make sure that parallel calculation is indeed working on your machine:

Make sure there are at least 20 CPUs on the compute node.
Compare the compute time for -n 20 -c 1, -n 1 -c 20, and -n 1 -c 1 (with ncpu=1) on a small test example. Make sure that -n 20 -c 1 is indeed the preferred setting on your HPC and that the expected speedup relative to the serial job is observed. This is worth checking because there might be system differences.

Now, things to consider when running signature discovery on a large number of samples.

First, try changing the default parameters to speed up the calculation.
The default parameters in DenovoSig are set quite conservatively to ensure convergence of the optimization algorithm. But we have observed empirically that even with more aggressive parameters, we could still achieve fairly good results. To scale up the calculation to a large number of samples (e.g., many thousands), it does make sense to use more aggressive parameters. I've just tested the following parameters on ~6300 TCGA WES samples and the discovered SBS signatures were of high quality to me. min_iter=1000, max_iter=10000, conv_test_freq=100, tol=1e-5, n_replicates=20, mvnmf_lambda_tilde_grid=np.array([1e-10, 1e-9, 1e-8, 1e-7, 1e-6, 1e-5, 1e-4, 1e-3, 1e-2, 1e-1, 1.0]). These parameters 1) terminate the optimization earlier and 2) use a more coarse-grained grid for selecting the best regularization parameter. For my test with ~6300 samples, it took 2hr10min with 20 CPUs on our HPC, when min_n_components=1, max_n_components=10. Note that you can always modify these parameters more towards the default later to see if there is any major difference in the results.

Second, try using NMF results to inform mvNMF runs.
DenovoSig with NMF (method='nmf') is fast: for example, with default parameters, it only took 40min for the dataset above with ~6300 samples (again 20 CPUs). This run will provide the best n_components (i.e., number of signatures) estimate, or at least the range of that. Then, you can run DenovoSig again, this time with mvNMF (method='mvnmf'), while setting min_n_components and max_n_components to be around that estimate. This procedure saves computation time by narrowing the search space for n_components.

Third, think about whether running signature discovery on the entire dataset is the optimal approach.
I don't know the details of your ~6000 samples. But for tumors, that number usually means a cohort composed of many tumor types. If that's the case, it's worth considering whether splitting them up is a more reasonable approach. For example, you can run signature discoveries per tumor type. You can also try stratifying samples into distinct groups beforehand with musical.preprocessing.stratify_samples(), which can run directly on the input matrix.

Hope this helps!

Best,
Hu

from musical.

Hu-JIN commented on August 19, 2024

Let me know if there are any further questions. Closing this issue for now.

from musical.

De novo signature discovery - very slow about musical HOT 2 CLOSED

Comments (2)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

Jobs