parklab / musical Goto Github PK
View Code? Open in Web Editor NEWA comprehensive toolkit for mutational signature analysis
License: Other
A comprehensive toolkit for mutational signature analysis
License: Other
See 6f5dfe0
When n_components is over-specified, mvNMF tends to produce rank deficient solutions, i.e., solutions with identical signatures. We can use this feature to select n_components. We need to do tests comparing this method with the current method based on samplewise reconstruction errors.
For an example, see #19
The current code is too slow.
Two small + related issues with multiple potential solutions: I tried using the remove_samples_based_on_gini method after performing an extraction with H from the DeNovo sig class and my mutation count matrix X โ loaded as a pandas DataFrame per the example_full_pipeline.ipynb โ as my inputs. This resulted in a NotImplentedError which I traced back to the mad = np.abs(np.subtract.outer(x, x)).mean()
line in the gini method of preprocessing.py. The problem seems to be that pandas no longer supports the outer method on DataFrames or Series. Converting the DataFrame to an numpy array fixed the problem, so I think some clearer documentation about the expected data types for the inputs and/or a conversion of DataFrames to arrays within the method would help. Similarly, if you pass H as a DataFrame as opposed to an array, you get a TypeError for the h_norm = h/np.sum(X, axis = 0)
line in the remove_samples_based_on_gini method, since that operation isn't supported on DataFrames. Same possible fixes would work!
We should follow more closely the design of sklearn estimators: https://scikit-learn.org/stable/developers/develop.html.
For example:
__init__
.__init__
to fit().Ideally, we should also do this for other classes, e.g., MVNMF, NMF, etc.
Currently we first filter out n_components that do not have stable solutions, and calculate a p-value for sample-wise reconstruction errors, and then we choose n_components based on those p-values.
Some other pieces of information could be used for automatically selecting the best n_components
Hi.
Musical is a better tool than Sigprofiler but it is better to develop a R version of Musical.
If you develop it, please let me know. Thank you very much.
Currently this test is only used in filtering out bad solutions. We should also use it in selecting n_components, in combination with a simple test for distribution means, which we currently use.
There are cases where a cluster contains only 1 sample. In those cases, the sil_score for that signature is currently 0. It might be better to set the sil_score to 1 in those cases.
Example: I ran DenovoSig with mvNMF on simulated PCAWG Lung-AdenoCA data, with init=random and n_replicates=20. SBS3 is difficult to be discovered. It turns out that, in 19 out of the 20 cases, two SBS4's are extracted, while in the rest 1 case, an SBS3 is discovered. So in the final clustering result, there is a SBS3 cluster with 1 sample, and a SBS4 cluster with 39 samples.
This example also illustrates why it may not be good to force each cluster to contain one and only one sample from each replicate, as done in SigProfiler.
Hello!
I followed the instructions for "Install from Source" in your README (I also checked to see if there was a recipe to install the package from conda using conda search *musical*
and via the bioconda site, and didn't find anything, which is why I went with install from source).
I am able to import MuSiCal in Python but when I run musical.load_catalog()
I get the following error message
FileNotFoundError: [Errno 2] No such file or directory: /opt/miniconda3/envs/{MYENVNAME}/lib/python3.10/site-packages/musical/data/COSMIC-MuSiCal_v3p2_SBS_WGS.csv'
The same error persists both in Jupyter and invoking Python in the console, so shouldn't be a kernel issue.
Inside the .../site-packages/musica/data/
directory there was only a __pycache__
directory and one py script, no further files.
Can you please advise on how to fix this issue? Possibly there was a problem with my installation? Though I used pip inside my conda environment and there were no issues reported. All dependencies were previously installed via conda.
The current parallelization of DenovoSig using multiprocessing works fine. Note that it is important to set sbatch -n to be ncpu in DenovoSig, instead of sbatch -c.
When sbatch -n is set to be ncpu in DenovoSig (with -c set to 1), I checked that the running time of each job performed by each worker is almost the same as that of a serial job. So we can almost get ncpu times speedup.
When sbatch -c is set to be ncpu in DenovoSig (with -n set to 1), then the running time of each job performed by each worker is about ~ncpu times slower than that of a serial job. As a result, we don't get any speedup.
So it is important to use sbatch -n instead of sbatch -c. That is contradictory to what I understood about sbatch before though. We need to understand more about this behavior.
Note that the results stated above are the same for NMF and mvNMF.
There could be potential improvements of our parallelization scheme. It is possible that some time is spent on pickling objects. See https://thelaziestprogrammer.com/python/a-multiprocessing-pool-pickle. Although, from the result above, there does not seem to be much overhead in our current code.
Write cython version of the NMF and mvNMF codes. That should further reduce running time.
When the reconstruction is precise (e.g., in simulations), the output of beta_divergence(beta=1) can be negative sometimes (very small values, e.g., -1e-14), due to numerical errors. It does not really matter, but is worth pointing out.
This rarely happens. But in some cases, say we have x = [0, 0, 0], and y = [0, 0, 0], stats.mannwhitneyu(x, y) will throw an error. Let's write a wrapper to take care of that. This is similar to what we've already done in differential_tail_test().
n_replicates needs to be either 1, which means no replicates will be done, or >= 10. We need to specify this somewhere to let the users know.
Of course for other n_replicates values, the code will still run and generally there won't be an issue. Setting n_replicates >= 10 just makes more sense with the default thresholds in _select_n_components(). See caf9254
Hello,
I am trying to use the refitting example, however, I cannot find "musical.refit" in the package. Should I use "musical.assign" or "musical.assign_grid"?
Thank you!
Check what the definition should be when there is 1 sample in the cluster, or just 1 cluster. Modify the code in _gather_results() accordingly.
I am trying to stratify my cohort and get an error during the following step:
k, clusters, Xs, model = musical.preprocessing.stratify_samples(X.values, H.values)
This is the error I receive:
MuSiCal/MuSiCal-main/musical/cluster.py:181: RuntimeWarning: divide by zero encountered in log
self.Wk_log = np.log(self.Wk)
MuSiCal/MuSiCal-main/musical/cluster.py:195: RuntimeWarning: divide by zero encountered in log
self.Wk_log_ref_all = np.log(self.Wk_ref_all)
python37_musical/lib/python3.11/site-packages/numpy/core/_methods.py:173: RuntimeWarning: invalid value encountered in subtract x = asanyarray(arr - arrmean)
MuSiCal/MuSiCal-main/musical/cluster.py:206: RuntimeWarning: invalid value encountered in subtract
self.gap_statistic_log = self.Wk_log_ref - self.Wk_log
Here is the beginning of my H.values:
array([[ 13.02370177, 22.36663251, 4.58321832, 2.30586172,
6.80032626, 7.65084841, 20.57174326, 13.90007932,
29.7988945 , 0.87010952, 11.40108359, 37.65595118,
3.78449284, 9.85849575, 8.67073113, 4.58633285],
And the beginning of X.values:
array([[8, 3, 3, ..., 0, 0, 1],
This is WXS data by the way. Thanks for your help!
Hi!
Thank you for developing this tool! I am currently using it for signature refitting and decomposition, but I'd like to also take advantage of the signature discovery module. However, I've been having some issues with runtime. When trying to obtain signatures from a matrix with ~6000 samples, a 24h job only gets to 1 signature extracted which is very slow compared to other tools and I'm wondering if I'm doing something wrong when assigning resources to the job.
These are the slurm options used (from the documentation I understand that to ncpu value should match the -n, not the -c)
#SBATCH -J musical
#SBATCH -n 20
#SBATCH -c 1
#SBATCH -t 24:00:00
#SBATCH --mem 20G
conda activate python37_musical
python3 Musical.py
And this is the MuSiCal script:
model = musical.DenovoSig(X,
min_n_components=1, # Minimum number of signatures to test
max_n_components=10, # Maximum number of signatures to test
init='random', # Initialization method
method='mvnmf', # mvnmf or nmf
n_replicates=20, # Number of mvnmf/nmf replicates to run per n_components
ncpu=20, # Number of CPUs to use
max_iter=100000, # Maximum number of iterations for each mvnmf/nmf run
bootstrap=True, # Whether or not to bootstrap X for each run
tol=1e-8, # Tolerance for claiming convergence of mvnmf/nmf
verbose=1, # Verbosity of output
normalize_X=False # Whether or not to L1 normalize each sample in X before mvnmf/nmf
)
model.fit()
And the musical logs:
Extracting signatures for n_components = 1..................
Selected lambda_tilde = 2. This lambda_tilde will be used for all subsequent mvNMF runs.
/home/mmunteanu/.conda/envs/python37_musical/lib/python3.7/site-packages/musical/mvnmf.py:509: UserWarning: No p-value is smaller than or equal to 0.05. The largest lambda_tilde is selected. Enlarge the search grid of lambda_tilde.
UserWarning)
n_components = 1, replicate 14 finished.
n_components = 1, replicate 10 finished.
n_components = 1, replicate 9 finished.
n_components = 1, replicate 18 finished.
n_components = 1, replicate 2 finished.
n_components = 1, replicate 1 finished.
n_components = 1, replicate 19 finished.
n_components = 1, replicate 7 finished.
n_components = 1, replicate 12 finished.
n_components = 1, replicate 15 finished.
n_components = 1, replicate 8 finished.
n_components = 1, replicate 6 finished.
n_components = 1, replicate 13 finished.
n_components = 1, replicate 11 finished.
n_components = 1, replicate 17 finished.
n_components = 1, replicate 3 finished.
n_components = 1, replicate 16 finished.
n_components = 1, replicate 4 finished.
n_components = 1, replicate 5 finished.
Time elapsed: 2.79e+04 seconds.
Extracting signatures for n_components = 2..................
Selected lambda_tilde = 0.1. This lambda_tilde will be used for all subsequent mvNMF runs.
Many thanks,
Maia
When the selected lambda_tilde in MVNMF is at the edge of the grid, we should give a warning for enlarging the grid.
Currently the output W of NMF.fit() is not normalized. It's better to do a normalization at the end such that W is normalized. I think the output W of MVNMF.fit() is always normalized. But we need to double check.
We perform NNLS frequently in the code, e.g., at the end of NMF and mvNMF, and after _gather_results(). But we still use KL divergence (e.g., beta divergence with beta = 1) for selection of lambda_tilde in mvNMF, selection of n_components, and filtering of bad solutions in _gather_results(). That is a bit inconsistent.
We can consider switching to beta divergence with beta = 2 once we have done NNLS.
In fact, it is never consistent because we do NMF/mvNMF with KL divergence and NNLS for inferring exposures. So no matter which norm we choose in the downstream selections, there will be inconsistencies.
I came into this issue when I was looking at samplewise errors. Sometimes by eye the reconstructed spectrum is very similar to the sample spectrum, but the KL divergence for that sample is relatively large. I think this is because in NNLS the frobenius norm is being optimized. If we do a KL-divergence version of NNLS, that problem should disappear.
So, maybe instead of doing NNLS, we can do a KL-divergence version of NNLS.
Keep track of sample names during calculations.
See title.
Hello! Thank you for sharing MuSiCal. I have been able to extract de novo signatures which was really exciting, but I am now having trouble with the part of the example full pipeline where Parameter optimization with in silico validation is meant to take place. I would love your help so that I can match to the right number of signatures. Thank you for your time.
After matching and refitting, we can look at the reconstruction errors per sample. And for those samples with large reconstruction errors, we can look at their error spectrum. That spectrum informs which signatures are not included in the refitting and should have been. We can go on adding those signatures. This mechanism basically allows more signatures to be considered in the refitting process (e.g., signatures that are not discovered in this tumor type). (Suggestion from Doga)
Update to COSMIC v3.2.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.